We have data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They performed barbell lifts correctly and incorrectly in 5 different ways. Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistake. The training dataset is taken from here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Our goal is to create model and predict class of test exercises.

1. Getting and cleaning data

There are many variables in training dataset, it is difficult to create model:

  dim(dfPMLTraining)
## [1] 19622   160

Let’s find and remove some not usable variables.
We can see that participants performed the exercises sequentially, one after another. All participants performed exercises of all classes (A, B, C, D, E). Variable X is just an order number of rows.

So, this 7 variables are unusable for creating model and we can remove them:

dfPMLTrainingEx <- subset(dfPMLTraining,
  select = c(-X, -user_name,
             -raw_timestamp_part_1,
             -raw_timestamp_part_2,
             -cvtd_timestamp,
             -new_window,-num_window))

There are variables in dataset, that have mostly NA or blank (equal to "") values.

  vNotEmptyColumns <- sapply(dfPMLTrainingEx,
                             function (x) {
                               (sum(is.na(x) | x == "")) < 0.9*nrow(dfPMLTrainingEx)
                               })
  table(vNotEmptyColumns)
## vNotEmptyColumns
## FALSE  TRUE 
##   100    53

We can remove this 100 variables too.

  dfPMLTrainingEx <- dfPMLTrainingEx[,vNotEmptyColumns]
  dim(dfPMLTrainingEx)
## [1] 19622    53

Now we have 53 variables to create the model instead of 160 in original dataset.

2. Create and compare models

Let’s split dataset (75% for training and 25% for validation), create several models on training partition and test them on validation partition.

  inTraining <- createDataPartition(y = dfPMLTrainingEx$classe,p = .75,list = F)
  dfPMLTrainingExTr <- dfPMLTrainingEx[inTraining,]
  dfPMLTrainingExTst <- dfPMLTrainingEx[-inTraining,]
  rm(dfPMLTraining, dfPMLTrainingEx)
  gc()
  fitRF <- train(classe ~ ., method = "rf", data = dfPMLTrainingExTr)
  predRF <- predict(fitRF, dfPMLTrainingExTst)
  confMatrRF <- confusionMatrix(predRF, dfPMLTrainingExTst$classe)
  strLabelRF <- fitRF$modelInfo$label
  strAccRF <- confMatrRF$overall[1]
  rm(predRF, confMatrRF)
  gc()
  fitTR <- train(classe ~ ., method = "rpart", data = dfPMLTrainingExTr)
  predTR <- predict(fitTR, dfPMLTrainingExTst)
  confMatrTR <- confusionMatrix(predTR, dfPMLTrainingExTst$classe)
  strLabelTR <- fitTR$modelInfo$label
  strAccTR <- confMatrTR$overall[1]
  rm(fitTR, predTR, confMatrTR)
  gc()
  fitBS <- train(classe ~ ., method = "gbm", data = dfPMLTrainingExTr, verbose = FALSE)
  predBS <- predict(fitBS, dfPMLTrainingExTst)
  confMatrBS <- confusionMatrix(predBS, dfPMLTrainingExTst$classe)
  strLabelBS <- fitBS$modelInfo$label
  strAccBS <- confMatrBS$overall[1]
  rm(fitBS, predBS, confMatrBS)
  gc()
  fitLDA <- train(classe ~ ., method = "lda", data = dfPMLTrainingExTr)
  predLDA <- predict(fitLDA, dfPMLTrainingExTst)
  confMatrLDA <- confusionMatrix(predLDA, dfPMLTrainingExTst$classe)
  strLabelLDA <- fitLDA$modelInfo$label
  strAccLDA <- confMatrLDA$overall[1]
  rm(fitLDA, predLDA, confMatrLDA)
  gc()

Now comapre Accuracy of predictions.

##                     MethodName  Accuracy
## 1                Random Forest 0.9951060
## 2                         CART 0.4951060
## 3 Stochastic Gradient Boosting 0.9692088
## 4 Linear Discriminant Analysis 0.7014682

The best result give a Random Forest method.
Now we can predict class of 20 test measurements.

  predRFfinal <- predict(fitRF, dfPMLTesting)
  predRFfinal
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E